Extracting Bilingual Collocations from Non-Aligned Parallel Corpora
نویسندگان
چکیده
This paper proposes a new method to find correspondences of uninterrupted collocations from Japanese-English bilingual corpora without sentence-to-sentence alignment. Uninterrupted collocations in English such as “once again”, “give up”, or “gross national product” handled as a single word or a compound word in Japanese, can be automatically extracted with corresponding Japanese words using word co-occurrence frequencies in both corpora. The method consists of two stages. First, English and Japanese collocations are extracted separately from given corpora. After successive word units, which become collocation candidates, are collected by using n-gram statistics of each word, two kinds of entropy values, after-unit and before-unit are calculated for each unit to select word units surpassing thresholds as uninterrupted collocations. Second, a correspondent translation of each uninterrupted English collocation is extracted from the Japanese corpus by calculating correlation values between the target collocation and Japanese words or collocations which co-occur in the given corpora and using a basic English to Japanese word unit dictionary. Experiments are executed on economic articles of Asahi Newspaper as corpora. A Japanese word unit for each extracted English collocation is automatically obtained with more than 70 % precision rate, whereas the rate was about 40 % if only word-toword correspondence is used.
منابع مشابه
Extracting collocations and their translations from parallel corpora
Identifying collocations in a text (e.g., break record) and correctly translating them (battre record vs. *casser record) represent key issues in machine translation, notably because of their prevalence in language and their syntactic flexibility. This article describes a method for discovering translation equivalents for collocations from parallel corpora, aimed at increasing the lexical cover...
متن کاملLearning Bilingual Collocations by Word-Level Sorting
This paper I)roposes ;t new tnethod for learning bilingual colloca, tions from sentence-aligned paralM corpora. Our method COml)ris('s two steps: (1) extracting llseftll word chunks (n-grmns) by word-level sorting and (2) constructing bilingua,l ('ollocations t)y combining the word-(;hunl(s a(-quired iu stag(' (1). We apply the method to a very ('hallenging text l)~tir: a stock market 1)ullet;i...
متن کاملA Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora
We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words ...
متن کاملCollocational Translation Memory Extraction Based on Statistical and Linguistic Information
In this paper, we propose a new method for extracting bilingual collocations from a parallel corpus to provide phrasal translation memories. The method integrates statistical and linguistic information to achieve effective extraction of bilingual collocations. The linguistic information includes parts of speech, chunks, and clauses. The method involves first obtaining an extended list of Englis...
متن کاملCollocation Extraction Using Monolingual Word Alignment Method
Statistical bilingual word alignment has been well studied in the context of machine translation. This paper adapts the bilingual word alignment algorithm to monolingual scenario to extract collocations from monolingual corpus. The monolingual corpus is first replicated to generate a parallel corpus, where each sentence pair consists of two identical sentences in the same language. Then the mon...
متن کامل